Image and Video Processing for Affective Applications
نویسنده
چکیده
Recent advances in the research area of affective computing have broadened the range of application areas of its findings, and additionally, as the state of the art advances in affective computing, other related research areas (computer vision, pattern recognition, etc.) discover new challenges that are related to image and video processing related to the task of automatic affective analysis. Although humans cope, relatively easily, with the task of perceiving facial expressions, gestural expressivity, and other visual cues involved in expressing emotion the automatic counterpart of the task is far from trivial. This chapter summarizes current research efforts in solving these problems and enumerates the scientific and engineering issues that arise in meeting these challenges toward emotion-aware systems. 1 The Problem Domain Because of its practical importance and the theoretical interest of cognitive and medical scientists (Ekman et al., 2002; Pantic, 2005; Chang et al., 2006), machine analysis of facial expressions attracted the interest of many researchers. For exhaustive surveys of the related work, readers are referred to Samal and Iyengar (1992) for an overview of early works, Tian et al. (2005) and Pantic and Bartlett (2007) for surveys of techniques for detecting facial muscle actions, and Pantic and Rothkrantz (2000, 2000) for surveys of facial affect recognition methods. However, although humans detect and analyze faces and facial expressions in a scene with little or no effort, development of an automated system that accomplishes this task is rather difficult. M. Pantic (B) Department of Computing, Imperial College, London, UK; Faculty of Electrical Engineering, Mathematics and Computer Science, University of Twente, Enschede, The Netherlands e-mail: [email protected] 101 P. Petta et al. (eds.), Emotion-Oriented Systems, Cognitive Technologies, DOI 10.1007/978-3-642-15184-2_7, C © Springer-Verlag Berlin Heidelberg 2011 102 M. Pantic and G. Caridakis 1.1 Level of Description: Action Units and Emotions Two main streams in the current research on automatic analysis of facial expressions consider facial affect (emotion) detection and facial muscle action (action unit) detection. These two streams stem directly from two major approaches to facial expression measurement in psychological research (Cohen, 2006): message and sign judgment. The aim of message judgment is to infer what underlies a displayed facial expression, such as affect or personality, while the aim of sign judgment is to describe the “surface” of the shown behavior, such as facial movement or facial component shape. Thus, a brow furrow can be judged as “anger” (Ekman, 2003; Kapoor et al., 2003) in a message-judgment and as a facial movement that lowers and pulls the eyebrows closer together in a sign-judgment approach. While message judgment is all about interpretation, sign judgment attempts to be objective, leaving inference about the conveyed message to higher order decision making. FACS (Ekman and Friesen, 1969, 1978) provides an objective and comprehensive language for describing facial expressions and relating them back to what is known about their meaning from the behavioral science literature. Because it is comprehensive, FACS also allows for the discovery of new patterns related to emotional or situational states. For example, what are the facial behaviors associated with driver fatigue? What are the facial behaviors associated with states that are critical for automated tutoring systems, such as interest, boredom, confusion, or comprehension? Research based upon FACS has also shown that facial actions can show differences between those telling the truth and lying at a much higher accuracy level than naive subjects making subjective judgments of the same faces (Cohn and Schmidt, 2004; Fasel et al., 2004). It is not surprising, therefore, that automatic Action Units (AU) coding in face images and face image sequences attracted the interest of computer vision researchers. Historically, the first attempts to encode AUs in images of faces in an automatic way were reported by Bartlett et al. (2006), Lien et al. (1998), and Pantic et al. (1998). These three research groups are still the forerunners in this research field. The focus of the research efforts in the field was first on automatic recognition of AUs in either static face images or face image sequences picturing facial expressions produced on command. Several promising prototype systems were reported that can recognize deliberately produced AUs in either (near-) frontalview face images (Anderson and McOwan, 2006; Samal and Iyengar, 1992; Pantic and Rothkrantz, 2003) or profile-view face images (Pantic and Rothkrantz, 2003; Pantic and Patras, 2005). These systems employ different approaches including expert rules and machine learning methods such as neural networks and use either feature-based image representations (i.e., use geometric features like facial points, see Sect. 2.3) or appearance-based image representations (i.e., use texture of the facial skin including wrinkles and furrows, see Sect. 2.3). One of the main criticisms that these works received from both cognitive and computer scientists is that the methods are not applicable in real-life situations, where subtle changes in facial expression typify the displayed facial behavior rather than the exaggerated changes that typify posed expressions. Hence, the focus of the research in the field started to shift to automatic AU recognition in spontaneous Image and Video Processing for Affective Applications 103 facial expressions (produced in a reflex-like manner). Several works have recently emerged on machine analysis of AUs in spontaneous facial expression data (e.g., Cohn, 2006; Bartlett et al., 1999; Valstar and Pantic, 2006). These methods employ probabilistic, statistical, and ensemble learning techniques, which seem to be particularly suitable for automatic AU recognition from face image sequences (see, e.g., Tian et al., 2001; Lien et al., 1998). 1.2 Facial Expression Configuration and Dynamics When it comes to research on automatic AU coding, automatic recognition of facial expression configuration (in terms of AUs constituting the observed expression) has been the main focus of the research efforts in the field. However, both the configuration and the dynamics of facial expressions (i.e., the timing and the duration of various AUs) are important for interpretation of human facial behavior. The body of research in cognitive sciences, which argues that the dynamics of facial expressions are crucial for the interpretation of the observed behavior, is ever growing (Ekman et al., 1993; Lee and Kim, 1999). Facial expression temporal dynamics are essential for categorization of complex psychological states like various types of pain and mood; they represent a critical factor for interpretation of social behaviors like social inhibition, embarrassment, amusement, and shame, and they are a key parameter in differentiation between posed and spontaneous facial displays (Ekman et al., 1993). For instance, spontaneous smiles are smaller in amplitude, longer in total duration, and slower in onset and offset time than posed smiles (e.g., a polite smile) (Cohn and Schmidt, 2004). Another study showed that spontaneous smiles, in contrast to posed smiles, can have multiple apexes (multiple rises of the mouth corners – AU12) and are accompanied by other AUs that appear either simultaneously with AU12 or follow AU12 within 1 s (Cohn et al., 2004). Similarly, it has been shown that the differences between spontaneous and deliberately displayed brow actions (AU1, AU2, AU4) are in the duration and the speed of onset and offset of the actions and in the order and the timing of actions’ occurrences (Valstar and Pantic, 2006). In spite of these findings, the vast majority of the past work in the field does not take dynamics of facial expressions into account when analyzing shown facial behavior. Some of the past work in the field has used aspects of temporal dynamics of facial expression such as the speed of a facial point displacement or the persistence of facial parameters over time (e.g., Lien et al., 1998). However, only three recent studies analyze explicitly the temporal dynamics of facial expressions. These studies explore automatic segmentation of AU activation into temporal segments (neutral, onset, apex, offset) in frontal(Pantic and Bartlett, 2007; Tian et al., 2005) and profile-view (Pantic and Patras, 2005) face videos. 1.3 Facial Expression Intensity and Context Dependency Facial expressions can vary in intensity. By intensity we mean the relative degree of change in facial expression as compared to a relaxed, neutral facial expression. 104 M. Pantic and G. Caridakis It has been experimentally shown that the expression-decoding accuracy and the perceived intensity of the underlying affective state vary linearly with the physical intensity of the facial display (Gu and Ji, 2004). Hence, explicit analysis of expression intensity variation is very important for accurate expression interpretation and is also essential to the ability to distinguish between spontaneous and posed facial behavior discussed in the previous sections. While FACS provides a 5-point intensity scale to describe AU intensity variation and enable manual quantification of AU intensity (Ekman and Friesen, 1978), fully automated methods that accomplish this task are yet to be developed. However, first steps toward this goal have been made. Automatic coding of intensity variation was explicitly compared to manual coding in Bartlett et al. (1999). They found that the distance to the separating hyperplane in their learned classifiers correlated significantly with the intensity scores provided by expert FACS coders. Rapid facial signals do not usually convey exclusively one type of messages. For instance, squinted eyes may be interpreted as sensitivity of the eyes to bright light if this action is a reflex (a manipulator), as an expression of disliking if this action has been displayed when seeing someone passing by (affective cue), or as an illustrator of friendly anger on friendly teasing if this action has been posed (in contrast to being unintentionally displayed) during a chat with a friend, to mention just a few possibilities. As already mentioned in Sect. 1.3, to interpret an observed facial expression, it is important to know the context in which the observed expression has been displayed – where the expresser is (outside, inside, in the car, in the kitchen, etc.), what his or her current task is, are other people involved, and who the expresser is. Knowing the expresser is particularly important as individuals often have characteristic facial expressions and may differ in the way certain states (other than the basic emotions) are expressed. Since the problem of context-sensing is extremely difficult to solve (if possible at all) for a general case, pragmatic approaches (e.g., activity/applicationand user-centered approach) should be taken when learning the grammar of human facial behavior (Pantic et al., 1998; Pantic and Patras, 2006). However, except for a few works on user-profiled interpretation of facial expressions like those of Fasel et al. (2004) and Pantic and Rothkrantz (2004a), virtually all existing automated facial expression analyzers are context insensitive. 1.4 Facial Expression Databases To develop and evaluate facial behavior analyzers capable of dealing with different dimensions of the problem space as defined above, large collections of training and test data are needed (Pantic and Rothkrantz, 2000; Tian et al., 2001). A complete overview of existing, publicly available data sets that can be used in research on automatic facial expression analysis is given by Pantic and Bartlett (2007). We will provide here a description of two relevant facial expression databases: the Cohn–Kanade database (Juslin and Scherer, 2005), which is the most widely used database in research on automated facial expression analysis, and the MMI facial expression database (Pantic et al., 2005a; Pantic, 2006), which Image and Video Processing for Affective Applications 105 represents the most comprehensive, online reference set of face images and videos of both deliberate and spontaneously displayed facial expressions. 2 The State of the Art Although humans detect and analyze faces and facial expressions in a scene with little or no effort, development of an automated system that accomplishes this task is rather difficult. There are several related problems (Pantic et al., 2006). The first is to find faces in the scene independent of clutter, occlusions, and variations in head pose and lighting conditions. Then, geometric facial features such as facial salient points (e.g., the mouth corners) or parameters of an appearance-based facial model (e.g., parameters of a fitted active appearance model) should be extracted from the regions of the scene that contain faces. The system should perform this accurately, in a fully automatic manner and preferably in real time. Eventually, the extracted facial information should be interpreted in terms of facial signals (winks, blinks, smiles, affective states, cognitive states, moods) in a context-dependent (personalized, task-, situation-, and application-dependent) manner. This section summarizes current research efforts in solving these problems and enumerates the scientific and engineering issues that arise in meeting these challenges.
منابع مشابه
Extending SAR Image Despckling methods for ViSAR Denoising
Synthetic Aperture Radar (SAR) is widely used in different weather conditions for various applications such as mapping, remote sensing, urban, civil and military monitoring. Recently, a new radar sensor called Video SAR (ViSAR) has been developed to capture sequential frames from moving objects for environmental monitoring applications. Same as SAR images, the major problem of ViSAR is the pres...
متن کاملSIDF: A Novel Framework for Accurate Surgical Instrument Detection in Laparoscopic Video Frames
Background and Objectives: Identification of surgical instruments in laparoscopic video images has several biomedical applications. While several methods have been proposed for accurate detection of surgical instruments, the accuracy of these methods is still challenged high complexity of the laparoscopic video images. This paper introduces a Surgical Instrument Detection Framework (SIDF) for a...
متن کاملA Novel Approach to Background Subtraction Using Visual Saliency Map
Generally human vision system searches for salient regions and movements in video scenes to lessen the search space and effort. Using visual saliency map for modelling gives important information for understanding in many applications. In this paper we present a simple method with low computation load using visual saliency map for background subtraction in video stream. The proposed technique i...
متن کاملAutomatic road crack detection and classification using image processing techniques, machine learning and integrated models in urban areas: A novel image binarization technique
The quality of the road pavement has always been one of the major concerns for governments around the world. Cracks in the asphalt are one of the most common road tensions that generally threaten the safety of roads and highways. In recent years, automated inspection methods such as image and video processing have been considered due to the high cost and error of manual metho...
متن کاملA Machine Learning Approach to No-Reference Objective Video Quality Assessment for High Definition Resources
The video quality assessment must be adapted to the human visual system, which is why researchers have performed subjective viewing experiments in order to obtain the conditions of encoding of video systems to provide the best quality to the user. The objective of this study is to assess the video quality using image features extraction without using reference video. RMSE values and processing ...
متن کاملطراحی و پیادهسازی سامانۀ بیدرنگ آشکارسازی و شناسایی پلاک خودرو در تصاویر ویدئویی
An automatic Number Plate Recognition (ANPR) is a popular topic in the field of image processing and is considered from different aspects, since early 90s. There are many challenges in this field, including; fast moving vehicles, different viewing angles and different distances from camera, complex and unpredictable backgrounds, poor quality images, existence of multiple plates in the scene, va...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011